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Human neurodevelopment is a highly regulated biological pro¬ 
cess. In this article, we study the dynamic changes of neurodevelop¬ 
ment through the analysis of human brain microarray data, sampled 
from 16 brain regions in 15 time periods of neurodevelopment. We 
develop a two-step inferential procedure to identify expressed and un¬ 
expressed genes and to detect differentially expressed genes between 
adjacent time periods. Markov Random Field (MRF) models are used 
to efficiently utilize the information embedded in brain region sim¬ 
ilarity and temporal dependency in our approach. We develop and 
implement a Monte Carlo expectation-maximization (MCEM) algo¬ 
rithm to estimate the model parameters. Simulation studies suggest 
that our approach achieves lower misclassification error and poten¬ 
tial gain in power compared with models not incorporating spatial 
similarity and temporal dependency. 

1. Introduction. Human neurodevelopment is a dynamic and highly reg¬ 
ulated biological process. Abnormalities in neurodevelopment may lead to 
psychiatric and neurological disorders, such as Autism Spectrum Disor¬ 
ders (ASD) [Geschwind and Levitt (2007), Walsh, Morrow and Rubenstein 
(2008), Sestan et al. (2012)]. The statistical methodology developed in this 
paper was motivated by our interest in studying human brain development 
using a microarray gene expression data set, which was collected from 1340 
tissue samples of 57 developing and adult post-mortem brains (including 39 
with both hemispheres) [Johnson et al. (2009), Kang et al. (2011)]. These 57 
post-mortem brains spanned from embryonic development to late adulthood. 
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Table 1 

The 15-period, system in Kang et al. (2011). M, postnatal 
months; PCW, post-conceptional weeks; Y, postnatal years 


Period 

Description 

Age 

1 

Embryonic 

4 PCW < Age < 8 PCW 

2 

Early fetal 

8 PCW < Age < 10 PCW 

3 

Early fetal 

10 PCW < Age < 13 PCW 

4 

Early mid-fetal 

13 PCW < Age < 16 PCW 

5 

Early mid-fetal 

16 PCW < Age < 19 PCW 

6 

Late mid-fetal 

19 PCW < Age < 24 PCW 

7 

Late fetal 

24 PCW < Age < 38 PCW 

8 Neonatal and early infancy 0 M (birth) < Age < 6 M 

9 

Late infancy 

6 M < Age < 12 M 

10 

Early childhood 

1 Y < Age < 6 Y 

11 Middle and late childhood 6 Y < Age < 12 Y 

12 

Adolescence 

12 Y < Age < 20 Y 

13 

Young adulthood 

20 Y < Age < 40 Y 

14 

Middle adulthood 

40 Y < Age < 60 Y 

15 

Late adulthood 

Age > 60 Y 


A 15-period system, demonstrated in Table 1, was defined to represent dis¬ 
tinct stages of brain development [Johnson et al. (2009), Kang et al. (2011)]. 
Except for periods 1 and 2, tissue samples from 16 brain regions were col¬ 
lected from both hemispheres in each brain, including the cerebellar cortex 
(CBC), mediodorsal nucleus of the thalamus (MD), striatum (STR), amyg¬ 
dala (AMY), hippocampus (HIP) and 11 areas of the neocortex, including 
the orbital prefrontal cortex (OFC), dorsolateral prefrontal cortex (DFC), 
ventrolateral prefrontal cortex (VFC), medial prefrontal cortex (MFC), pri¬ 
mary motor cortex (MIC), primary somatosensory cortex (SIC), posterior 
inferior parietal cortex (IPC), primary auditory cortex (A1C), posterior su¬ 
perior temporal cortex (STC), inferior temporal cortex (ITC) and the pri¬ 
mary visual cortex (VIC) [Johnson et al. (2009), Kang et al. (2011)]. Details 
on the brain regions are described in the supplementary material Section 1 
[Lin et al. (2015)]. 

The goal of our analysis is to characterize human neurodevelopment 
through the dynamics of gene expression, such as the identification of ex¬ 
pressed and unexpressed genes, and differentially expressed (DE) genes over 
time in each brain region. The unique challenge presented for statistical 
analysis of this data set is the appropriate modeling and analysis of the 
spatial-temporal structure. For gene expression data with only temporal 
structure (e.g., time course gene expression data), various methods have 
been proposed to model the temporal dependency to better identify DE 
genes. However, as far as we know, none of the existing methods utilizes the 
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information embedded in the spatial similarity between brain regions, as 
indicated by the high correlation in gene expression levels between brain re¬ 
gions in the same period [supplementary material Section 2, Lin et al. (2015) 
and Kang et al. (2011)]. For time course gene expression data, the existing 
methods can be classified into two broad categories: (1) methods that iden¬ 
tify DE genes between multiple biological conditions [Storey et al. (2005), 
Hong and Li (2006), Tai and Speed (2006), Yuan and Kendziorski (2006)]; 
and (2) methods that identify DE genes over time in one biological condi¬ 
tion [Storey et al. (2005), Tai and Speed (2006), Wu et al. (2007), Liu and 
Yang (2009)]. Statistical models that have been proposed to incorporate the 
temporal structure include Hidden Markov Models [Yuan and Kendziorski 
(2006), Wu et al. (2007)], functional models using basis function expansions 
[Storey et al. (2005), Hong and Li (2006), Wu et al. (2007)], function prin¬ 
cipal component analysis [Liu and Yang (2009)] and multivariate empirical 
Bayes models [Tai and Speed (2006)]. 

To efficiently capitalize on brain region similarity and temporal depen¬ 
dency, we propose a two-step Markov Random Field (MRF)-based approach 
to answer the following two biological questions: 1. Which genes are ex¬ 
pressed/unexpressed in each period and in each brain region? 2. Which 
genes are differentially expressed over time in each brain region? We note 
that MRF models have been used to model dependency in genomics data, 
such as neighboring genes defined by biological pathways [Li, Wei and Li 
(2010), Chen, Cho and Zhao (2011), Wei and Li (2007, 2008)] and marker 
dependencies defined by linkage disequilibrium [Li, Wei and Maris (2010)]. 
Across all the brain regions and time periods, the histogram of the ob¬ 
served gene expression levels has a bimodal distribution, where the two 
components likely represent expressed and unexpressed genes [supplemen¬ 
tary material Section 4, Lin et al. (2015) and Kang et al. (2011)]. In this 
paper, we first use a Gaussian mixture model-based approach to identify 
the unexpressed and expressed genes. The model fit and the robustness of 
the Gaussian mixture model are discussed in the supplementary material 
Section 4 [Lin et al. (2015)]. We note that an “unexpressed” gene does not 
necessarily suggest that there is no mRNA molecules of that gene in the 
cell, but rather the gene’s expression level is very low and the observed vari¬ 
ation in the expression values may be mostly due to noise in the microarray 
experiment. In the second step, our methodology utilizes the local false dis¬ 
covery rate (f.d.r.) framework [Efron (2004)] to identify DE genes between 
adjacent time periods. We propose an efficient Monte Carlo expectation- 
maximization (MCEM) algorithm [Wei and Tanner (1990)] to estimate the 
model parameters and a Gibbs sampler to estimate the posterior probabili¬ 
ties. 

The key feature of our approach is to simultaneously consider spatial 
similarity and temporal dependency of gene expression levels to better ex¬ 
tract biologically meaningful results from the data. We introduce the MRF 
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model in Section 2 and present the Monte Carlo expectation-maximization 
(MCEM) algorithm for statistical inference in Section 3. We also present 
the posterior probability estimation and the FDR controlling procedure in 
Section 3. In Section 4 we apply our method to analyze the human brain 
microarray data reported in Kang et al. (2011). Results from simulation 
studies are summarized in Section 5. We conclude the paper with a brief 
discussion in Section 6. 

2. Statistical models and methods. 

2.1. Biological question 1: Identify expressed and unexpressed genes. 

2.1.1. Gaussian mixture model for microarray data. In our human brain 
microarray data, expression levels were measured for G = 17,568 genes on the 
Affymetrix GeneChip Human Exon 1.0 ST Array platform. For quality con¬ 
trol, RMA background correction, quantile normalization, mean probe set 
summarization and log 2 -transformation were performed [Kang et al. (2011)]. 
Details for the quality control procedures are described in the supplemen¬ 
tary material Section 3 [Lin et al. (2015)]. The number of brains that were 
collected varies across time periods and for some brains, tissue samples are 
missing for certain brain regions. So the number of samples varies among 
brain regions and time periods. We treated samples from the same brain 
region and time period as biological replicates. Periods 1 and 2 correspond 
to embryonic and early fetal development, when most of the 16 brain regions 
sampled in future periods have not differentiated (i.e., most of the 16 brain 
regions are missing data in periods 1 and 2). Therefore, samples in periods 1 
and 2 are excluded in our analysis. In total, we consider B = 16 brain regions 
sampled in T = 13 periods of brain development. Let nu denote the num¬ 
ber of replicates for brain region b in period t, = (n^i,... ,nu, ■ ■ ■ ,nbT)' 
is the column vector for the number of replicates for brain region 6, and 
N = (Ni,.... N&,..., N^) is the matrix summarizing the number of repli¬ 
cates across brain regions and periods. The entries in N range from 1 to 16 
and the median is 5. Let ybgtk denote the observed gene expression value for 
gene g in the kth replicate of samples in brain region b and period t, and let 
y b g t = (ybffti, ■ ■ ■ iUbgtn bt ) denote the expression values for all the replicates. 
We assume that ybgtk , for k = 1,..., nu, follows the same normal distribution 
with mean ybgt and standard deviation a q: 

Vbgtk ~ Ni.H’bgt, ^"o ) • 

Let Xbgt be the binary latent state representing whether gene g is expressed 
in brain region b and period t, that is, Xb g t = 1 if the gene is expressed and 
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0 otherwise. Conditioning on Xbgt, we assume that fibgt follows a Gaussian 
distribution: 


Hbgt\%bgt 0 ~ 

Fbgtl^bgt 1 ~ -^"(h'26) *^2b) - 

Marginally, fibgt follows a Gaussian mixture distribution. We assume that 
the mean and the variance for the mixture components are brain region 
specific. Denote by fi 1 , yu 2 , or, <X 2 the vectors of parameters for all brain 
regions. It is easy to see that the distribution of ybgtk conditioning on Xb g t 
has the following form: 

VbgtkVbgt = 0 ~ N(n lb ,(jl b + CTq), 

Vbgtk\%bgt 1 N (/^26? ^26 ^o)* 

Given the latent state array X, conditional independence is assumed: 


B G T 

/(Yix)=nnn f{ybgt\Xbgt), 
6=1 g= 1t=l 

where 

n bt 

f (y bgt\%bgt) j j f (jjbgtkl^bgt) • 

k= 1 


2.1.2. ^4 MRF model for p(X). One key component in the above model 
and the inferential objective is the latent state array X, which is unknown 
to us. Now we discuss how to specify the prior on X, denoted by p(X), 
through a MRF model that takes into account both temporal dependency 
and spatial similarity. For each gene g, we construct an undirected graph 
G g = {Vg,E g }, where Vg = {xbgt ■ b = 1, •.., B , t = 1,..., T} is the set of nodes 
and E g is the set of edges. E g can be divided into two subsets, E g \ and E g 2 , 
where E gl = {(x bg t, x b ' g t') ■ b / b' and t = t'} and E g2 = {(x bg t, Xb> g t>) ■ b = 
b' and \t — t'\ = 1}. E g \ contains the edges capturing spatial similarity be¬ 
tween brain regions and E g 2 contains the edges capturing temporal depen¬ 
dency between adjacent periods. For the joint distribution of p(X), we con¬ 
struct a pairwise interaction MRF model [Besag (1986)] with the following 
form: 

° f 

P(X|S) oc n expj 70 ^ h{x bgt ) +-f 1 1 (x hgt ) 

9 =i 1 v g v g 

“I - Pi ^ ] [-1q (xbgt ) d{) (xj/gt/ ) T Il{xbgt)dl{xb’ gt’f\ 

Egl 


(1) 
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“ 1 “ h ^ ^ [-^0 i.X bg t)Io (•Eb'gt' ) “ 1 “ -^1 {^bgt )-^l (%b'gt' )] 


%2 


where Jo(-) and Ii(-) are the indicator functions. Letting 7 = 71 —70, the con¬ 
ditional probability can be derived (see Appendix for the details of deriva¬ 
tion) : 


(2) 


where 


P(.X bg t , *F) 


exp {x b g t F(x b gt,&)} 

1 + exp{F(x bfft ,$)}’ 


F(x bgt , $) = 7 + /3i '^2(2x blgt - 1) 

V±b 

"F 1] T [‘^• z bg(t+l) 1]}) 

where “/” means other than; $ = (7,/?!,$2) and 7,/3 i ,/02 G R; /3i is the 
parameter capturing the spatial similarity and @2 is the parameter capturing 
the temporal dependency. 


2.2. Biological question 2: Identify DE genes over time. 

2.2.1. A latent state model for DE. For DE analysis, we first transform 
the observed data into an array where the entries are then used in the follow¬ 
up analysis. This is accomplished by performing t-tests between adjacent 
periods and transforming the t-statistics into z-scores. Let yb g (t-i) an d y bgt 
denote the vectors of expression values for gene g in region b and in periods 
t — 1 and t, respectively. The two-sample t-statistic is obtained by 

_ y bgt — Ybgft— 1) 

hg(t-l) “ ~ j 

where s is an estimate of the standard error for y bgt — ywt-i) ■ The test 
statistic t bg u i) is then transformed into z bg u_\y. 

z bg{t- 1) — d 1 (F nbt +n b ( t _ 1 ' ) —2(tbg(t—l)))i 

where n b y_y and n bt are the numbers of replicates in y bg (t-i) and y bg t‘i ^ 
and F nbt+nb ^ t _ 1) -2 are the c.d.f.s for standard normal and t distribution with 
n bt + n b u_i) — 2 degrees of freedom. As a result, the gene expression data are 
represented by a 5 x G x (T — 1) z-score array Z. The entry z bg t represents 
the evidence of DE between periods t and t + 1 for gene g in brain region b. 
Some entries in the array are not assigned values because of the presence of 
unexpressed genes. The variations in the expression values of unexpressed 
genes may be mostly caused by noise in the microarray experiments and we 
do not want to include that noise in identifying DE genes; the transitions 
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from unexpressed to expressed and vice versa are already captured in biolog¬ 
ical question 1. Therefore, no f-test is performed if the gene is unexpressed 
in at least one of the adjacent periods. Let Sbgt denote the binary latent 
state representing whether gene g is differentially expressed in brain region 
b between periods t and t + 1, which is the objective of our inference. Let 
S be the latent state array of dimensions BxGx(T-l). Conditioning on 
Sbgt , we assume that Zb g t follows a mixture distribution: 

/("6</t \sbgt ) — (1 Sbgt) fo^Zbgt) T Sbgtfl (Zbrjl), 

where fo(z) is the null density and f\{z) is the nonnull density. We as¬ 
sume that the null density follows a standard normal A/"(0,1) distribution. 
We adopt the nonparametric empirical Bayesian framework for DE [Efron 
(2004)] by fitting the nonnull density with a natural spline using the R 
package locfdr. Given S, conditional independence is assumed: 

B G T -1 

w)-nnn /{Zbgt \sbgt)- 

6=1 g= 1 t= 1 


2.2.2. A MRF model for p( S). Next, we present a MRF model for the 
prior distribution p{ S), taking into account both temporal dependency and 
spatial similarity. We separate the 16 brain regions into two groups: 11 neo¬ 
cortex regions, represented by B c , and 5 nonneocortex regions, represented 
by B„. The joint probability is similar to (1), except that different spatial 
parameters are assumed for the two groups. The conditional probability can 
be calculated and has the following form: 


( 3 ) 

if b e B c , 


P(Sbgt | Sbgt , ^De) 


exp{sb g tFpE(sbgt, *Lde)} 

1 + exp{T de (sbgt ■ d?DE)} ’ 


^DE {Sbgt, 3?De) = 7DE + P cc E (flSb'gt 1) T P cn E {2Sb'gt - 1) 

6'eB c /6 6'gB„ 

+ /3fU#l[2s& s (t_l) — 1 ] + It^T[2Sbg(t+l) ~ 1 ]}) 


else if b £ B„, 


Fde (sbgt- ^de) = 7de + firm E {^Sb'gt — 1) + Pnc ^2 {2Sb'gt ~ 1) 

b'eB n/h 6'GB c 

+ Pt{It^l[2 s bg(t-l) ~ 1 ] + I#T[2Sbg(t+l) ~ 1 ]}) 

where 3 >de = (Pcc, Pnn, Pen, Pnc), Pec is the between neocortex coefficient, /3 nn 
is the between nonneocortex coefficient, p cn is the neocortex to nonneocortex 
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coefficient, and /3 nc is the nonneocortex to neocortex coefficient. For symme¬ 
try, we assume that /3 cn = (3 nc . In the MRF model in Section 2.1.2, we did 
not separate the brain regions into two groups because the latent states for 
all brain regions were quite similar, which will be shown in Section 4. 

3. Parameter and posterior probability estimation. 


3.1. Parameter estimation for biological question 1: Identify expressed 
and unexpressed genes. In the model, the MRF parameters # = (-y, /3i, /? 2 ) 
and the Gaussian mixture model parameters 0 = (/i l5 aq, /i 2 ) <J 2 ) need to 
be estimated. Given the latent state X, both $ and 0 can be estimated 
by the maximum likelihood estimates (MLE). However, the latent state is 
unobserved and needs to be estimated as well. Although the expectation- 
maximization (EM) algorithm is generally implemented for missing data 
estimation, it is not applicable to our model as the expectation term is not 
tractable. Therefore, we propose the following Monte Carlo EM Algorithm 
[Wei and Tanner (1990)] to estimate $ and 0: 

1. Estimate uq by the unbiased estimator: 


B T n bt 


°o = 


G X Sb=l Xu=l( n &i 1) q=l 6=1 t=L k=l 


Vbgt) ■ 


2. Obtain the initial estimates X and 0 by the simple Gaussian mixture 
model, without considering spatial and temporal dependency. 

3. Because there is no explicit MLE for an initial estimate $ is cho¬ 
sen which maximizes the following pseudolikelihood function Z(X; 3>) [Besag 
(1974)]: 

B G T 

<(*;*)=nnn^i ■^/ &bgt) ^)> 

b=lg=lt=l 


where p(x bg t\X-/x bg t', 3?) is as defined in (2). 

4. Let \F = (<&, 0). The expected complete data log-likelihood in the EM 
algorithm is approximated by the Monte Carlo sum [Wei and Tanner (1990)]: 

1 m 

(4) Q m (¥|¥ (r) ) = -^ln/(Y,xS r) |v|>), 

i=i 


where X^\ ... ,x!(' are obtained by Gibbs sampling. From X^ to X^^, 

I r \ 

all entries in X) are updated, and they are updated sequentially by 
(5) p(x bgt \Y,X/x bgt ;4f < ' r) )(xp(x bgt \X/x b g t -,^ (r) )f(y bgt \x b g t ;@ {r) ). 
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5. Update IF by \F , which maximizes (4): 

^ (r+1) /v (7*) 

\F = argmaxQ m (^ , |^' ). 

Same as in step 3, we replace the likelihood by the pseudolikelihood function 

a . ( t *) 

in QmiV I'f )■ The terms that contain $ and © are separable, therefore, 
they can be optimized separately. 

6 . Repeat steps 4 and 5 until convergence. 

3.2. Parameter estimation for biological question 2: Identify DE genes 
over time. In the model, only the parameters $ in the MRF prior need to 
be updated iteratively. The algorithm shares some similarity with that in 
the previous section: 

1. Pool the z-scores in Z and estimate f\ by the locfdr procedure. 

2. Obtain an initial estimate S by the simple mixture model, without 
considering spatial and temporal dependency. 

3. Obtain an initial estimate $dEi which maximizes the pseudolikelihood 
function: 

B G T -1 

Z(S;$de) = nnn P(.Sbgt | S/ Sbgt ) ^De) j 

b= 1 g= 1 t= 1 

where p(§bgt |S/J& 9 t; <&de) is as defined in (3). 

4. Approximate the expected complete data log-likelihood by the Monte 
Carlo sum: 

1 m 

(6) Q m ($ DE |$DE) = -Vln/(Z,S; r) |^ DE ), 

l=i 

where \ ..., Sm are obtained by Gibbs sampling. From to all 

(r) 

entries in Sj are updated, and they are updated sequentially by 

(1) P( s bgt |Z, S/Sbgt'i ^De) C*-P( s bgt\S / SbgtT f ( z bgt\ s bgt) ■ 

5. Update 3 >de by <F DE , which maximizes (6). 

6 . Repeat steps 4 and 5 until convergence. 

3.3. Posterior probability estimation and FDR controlling procedure. To 
acquire an estimate of the posterior probability, we implement a separate 
Gibbs sampler and keep the model parameters fixed at the estimated values 
by the MCEM algorithm. The latent states in biological questions 1 and 2 
are updated sequentially according to (5) and (7). 
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For the inference of expressed/unexpressed genes, we use 0.5 as the cutoff 
for the posterior probability. For the inference of DE genes, we adapt the 
posterior probability-based definition of FDR [Newton et al. (2001), Li, Wei 
and Maris (2010)]. The posterior local f.d.r. qb g t =p{sb g t = 0|Z) is estimated 
by the Gibbs sampler. Let be the sorted values of qb g t in ascending 
order. Find k = max{f: | ^s=i Q(s) — a } an d reject all the null hypotheses 
iL( s ), for s = 1,..., k. In the analysis of human brain gene expression data, 
we chose a = 0.05. 

4. Application to the human brain microarray data. 

4.1. Identify expressed and unexpressed genes. We first applied the MRF 
model to infer whether a gene is expressed or not in a certain brain region 
and time period. In the parameter estimation, we first ran 20 iterations of 
MCEM by a Gibbs sampler with 500/1500 (1500 iterations in total and 500 
as burn-in), then 20 iterations with 1000/6000 and, finally, 20 iterations with 
1000/10,000. We gradually increased the number of iterations in the Gibbs 
sampler to make the estimate of the parameters more stable. The posterior 
probability was then estimated by a Gibbs sampler with 10,000 iterations 
and 1000 as burn-in. A diagnosis for the number of iterations is presented 
in the supplementary material Section 5 [Lin et al. (2015)]. 

The estimated parameters for the Gaussian mixture model are shown 
in Table 2. The estimated parameters for the MRF prior were 7 = 0.30, 


Table 2 

The estimated parameters for the Gaussian mixture 
model 


Region 

Mi 

M 2 

cr 1 

er 2 

MFC 

4.58 

7.82 

0.59 

1.57 

OFC 

4.57 

7.83 

0.59 

1.58 

VFC 

4.56 

7.84 

0.58 

1.59 

DFC 

4.58 

7.83 

0.58 

1.58 

STC 

4.62 

7.8 

0.58 

1.56 

ITC 

4.61 

7.81 

0.58 

1.57 

A1C 

4.6 

7.82 

0.58 

1.57 

IPC 

4.61 

7.81 

0.58 

1.57 

SIC 

4.61 

7.82 

0.58 

1.58 

MIC 

4.60 

7.82 

0.58 

1.58 

VIC 

4.63 

7.78 

0.59 

1.55 

AMY 

4.65 

7.76 

0.6 

1.52 

HIP 

4.64 

7.77 

0.61 

1.54 

STR 

4.65 

7.78 

0.62 

1.55 

MD 

4.62 

7.81 

0.63 

1.59 

CBC 

4.61 

7.76 

0.65 

1.58 
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Fig. 1. The number of genes that changed from expressed to unexpressed and vice versa 
in adjacent periods. Each line represents a brain region. 


f3\ = 0.22 and /?2 = 6.44. The large coefficient in /?2 indicates strong tem¬ 
poral dependency. Compared with the total number of genes (17,568), only 
a small number of genes changed their latent states between adjacent peri¬ 
ods (Figure 1). The table for the numbers are presented in supplementary 
material Section 8 [Lin et al. (2015)]. For all brain regions, a general trend 
can be observed: the number of genes that changed their latent states first 
increased, peaked in periods 6 to 7, the number in periods 7 to 8 was also 
large, then gradually decreased, starting from periods 12 to 13, fewer than 
15 genes changed their latent states. Period 8 corresponds to birth to 6 post¬ 
natal months. The observation that the changes in gene expression peaked 
from periods 6 to 8 suggests that robust changes in gene expression occurred 
close to birth. 

Moreover, we observed that the latent states for the same gene in all 
brain regions tended to agree with each other. These are summarized in 
Table 3, where we considered all genes by time combinations, that is, GxT = 
17,568 x 13 = 228,384, and counted the number of genes that were expressed 
in a given number of brain regions. Although the MRF prior encourages the 
agreement of latent states, the observation is unlikely driven by the model, 
as we observed a similar trend when the spatial coefficient f5± was fixed to 
be 0 (supplementary material Section 8 [Lin et al. (2015)]). 

Genes that changed states over time may be of biological interest for the 
study of brain development. We conducted Gene Ontology (GO) enrichment 
analysis using DAVID, which takes a list of genes as input and outputs the 
enriched Gene Ontology (GO) terms [Huang et al. (2008), Sherman et al. 
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Table 3 

Summary of the latent states by 
pooling brain regions. “0” represents 
the total count of genes that were 
unexpressed in all brain regions and 
“16” represents the total count of 
genes that were expressed in all brain 
regions 


0 

89,347 

1 

2560 

2 

541 

3 

218 

4 

95 

5 

62 

6 

31 

7 

52 

8 

31 

9 

26 

10 

19 

11 

46 

12 

42 

13 

94 

14 

99 

15 

297 

16 

134,824 


(2009)]. A GO term represents the functional annotation of a list of genes and 
may belong to any of the following three categories: (a) genes that partici¬ 
pate in the same biological process, (b) genes that have the same molecular 
function, and (c) genes that are located in the same cellular component. Only 
GO terms in categories (a) and (b) were included in our analysis, as genes 
located in the same cellular component do not necessarily share similar func¬ 
tions. We observed enrichment of GO terms only from periods 6 to 7 (0.05 
threshold for Bonferroni-adjusted p- value). From periods 6 to 7, genes that 
switched from expressed to unexpressed in all brain regions were enriched for 
“DNA binding” (Bonferroni adjusted p-value = 1.6 x 10~ 9 ), “regulation of 
transcription, DNA-dependent” (Bonferroni adjusted p-value = 2.5 x 10 -4 ) 
and “zinc ion binding” (Bonferroni adjusted p-value = 9.5 x 10~ 5 ); there 
were no enriched GO terms for genes that switched from unexpressed to ex¬ 
pressed. The enrichment of transcription regulation and DNA binding pro¬ 
teins (including zinc-finger proteins coordinated by the binding of zinc ions) 
is consistent with our previous observation that robust changes in transcrip¬ 
tion occurred close to birth. Changes in transcriptional regulation may also 
lead to the peak of differentially expressed genes (see Section 4.2). Details 
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Fig. 2. The number of DE genes identified in each time window of adjacent periods. 
Each line represents a brain region. 


for the GO enrichment analysis are presented in the supplementary material 
Section 6 [Lin et al. (2015)]. 

4.2. Identify DE genes over time. After excluding genes that were unex¬ 
pressed in all brain regions and all periods, 11,370 genes remained. We then 
applied the MRF model to identify DE genes between adjacent periods. The 
settings for the MCEM algorithm and the Gibbs sampler were the same as 
that in the previous section. 

The estimated MRF parameters were 7 de = —0.10, f3 cc = 0.32, /3 nn = 0.53, 
Pen = 0.06, and fit = 0.15. The temporal coefficient fit was much smaller 
compared with that in the previous section (where @2 = 6.44), which suggests 
lower temporal dependency. The neocortex to nonneocortex coefficient /3 cn 
was much smaller than the neocortex to neocortex coefficient /3 CC and the 
nonneocortex to nonneocortex coefficient /3 nn , which indicates the group 
difference between neocortex and nonneocortex regions. 

When no spatial and temporal dependency is assumed, the model reduces 
to a simple empirical Bayesian (EB) model. Based on the posterior FDR 
control procedure described in Section 3, the thresholds in the MRF and 
EB models were 0.26 and 0.12, respectively. The numbers of genes identified 
as DE in the two models were 356,207 (MRF) and 77,330 (EB), with 74,228 
(96%) overlap. The higher threshold led to more genes identified as DE in the 
MRF model. The numbers of DE genes identified are presented in Figure 2, 
where each line represents a brain region. The table of the exact numbers is 
presented in the supplementary material Section 9 [Lin et al. (2015)]. For the 
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Table 4 

Summary for the direction of changes in gene expression by pooling neocortex regions. 
Each row represents a time window. The “0” column represents the counts of genes that 
were down-regulated in all neocortex regions and the “11” column represents the counts 
of genes that were up-regulated in all neocortex regions 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

Periods 3-4 

163 

5 

1 

0 

0 

0 

0 

0 

0 

0 

1 

47 

Periods 4-5 

1039 

31 

3 

3 

0 

0 

1 

3 

0 

4 

18 

436 

Periods 5-6 

539 

30 

3 

1 

1 

0 

1 

0 

0 

2 

20 

417 

Periods 6-7 

3475 

28 

3 

2 

1 

1 

2 

2 

0 

2 

29 

1238 

Periods 7-8 

1014 

14 

1 

0 

0 

0 

0 

0 

0 

1 

3 

1640 

Periods 8-9 

387 

5 

0 

0 

0 

0 

0 

0 

0 

0 

1 

146 

Periods 9-10 

1034 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

351 

Periods 10-11 

342 

2 

0 

0 

0 

0 

0 

0 

0 

0 

3 

1124 

Periods 11-12 

915 

9 

0 

0 

0 

0 

0 

0 

0 

0 

1 

485 

Periods 12-13 

450 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

204 

Periods 13-14 

263 

5 

0 

0 

0 

0 

0 

0 

0 

0 

2 

39 

Periods 14-15 

107 

22 

0 

0 

0 

0 

0 

0 

0 

0 

5 

149 


number of DE genes, the trend over time was slightly different from that 
in the previous section. In addition to the peak close to birth, there was 
another peak that spanned from early childhood (period 10) to adolescence 
(period 12). The peak was less obvious in the 5 nonneocortex regions (AMY, 
HIP, STR, MD and CBC). During these periods, motor skills, social skills, 
emotional skills and cognitive skills are rapidly developed. The second peak 
may correspond to the development of these essential skills. Genes that 
were DE in the second peak may be of interest to researchers studying these 
behaviors. Note that there was a slight decrease in DE genes in periods 
5-6 compared with that in periods 4-5. The decrease was most obvious 
in brain region STR. Further biological studies are needed to understand 
the trend. We randomly split the data into two subsets and implemented 
the algorithm separately for each subset. Compared with the EB model, 
the genes identified as DE by the MRF model were more likely to overlap: 
56.2% vs. 12.4% (supplementary material Section 9 [Lin et al. (2015)]). The 
information for the direction of changes in gene expression was not utilized 
in the model. However, we observed that DE genes in all neocortex regions 
tended to have the same direction of changes (Table 4). Therefore, the MRF 
model is able to detect consistent changes in gene expression among the 
brain regions, which may be missed by other approaches not considering 
temporal and spatial similarity. 

Autism Spectrum Disorders (ASD) are a group of syndromes character¬ 
ized by fundamental impairments in social reciprocity and language devel¬ 
opment accompanied by highly restrictive interests and/or repetitive behav- 





ANALYSIS OF HUMAN BRAIN DEVELOPMENT 


15 


iors [American Psychiatric Association (2000)]. By exome sequencing, loss 
of function (LoF) mutations with large biological effects have been shown 
to affect ASD risk [Iossifov et al. (2012), Kong et al. (2012), Neale et ah 
(2012), O’Roak et ah (2011, 2012), Sanders et ah (2012)]. A set of nine 
high-confidence ASD risk genes have been identified recently: ANK2, CHD8, 
CUL3, DYRK1A, GRIN2B, KATNAL2, POGZ, SCN2A, TBR1 [Willsey 
et ah (2013)]. These nine genes carry LoF mutations in ASD patients. De¬ 
tails for the genes are described in the supplementary material Section 7 
[Lin et ah (2015)]. Next we analyzed the nine ASD risk genes in the human 
brain gene expression data set. Among the nine genes, KATNAL2 and CHD8 
were unexpressed. The other seven genes were expressed in all brain regions 
and all periods. Gene expression study on postmortem autistic brains and 
structural magnetic resonance imaging studies have highlighted the frontal 
cortex as pathological in ASD patients [Amaral, Schumann and Nordahl 
(2008), Voineagu et ah (2011)]. In the brain gene expression data, five re¬ 
gions were sampled in the frontal cortex: OFC, DFC, VFC, MFC and MIC. 
The gene expression curves for TBR1 and CHD8 are shown in Figure 3. The 
five frontal cortex regions shared similar dynamics for the two genes. TBR1 
was differentially expressed in periods 4-5 and 6-7, while CHD8 remained 
unexpressed. We performed a binomial test to see whether the ASD gene set 
was enriched for DE genes, compared with the overall distribution (Table 5). 
In the binomial test, a gene was counted as DE only if it was DE in all five 
frontal cortex regions. We observed an increased fold change of DE genes in 
the ASD gene set in periods 4-5, 5-6, 6-7, 9-10 and 10-11. It is interesting 
to note the gap that spanned periods 7 to 9, when the ASD genes tended to 
be equally expressed. For periods 4-5 and 9-10, the enrichment was signifi¬ 
cant (<0.05). Period 10 corresponds to early childhood (1 < Age < 6), when 
social, emotional and cognitive skills are observed [Kang et al. (2011)]. The 
most obvious signs of autism tend to emerge between 2 and 3 years of age. 
In periods 9-10, there were four DE genes: SCN2A, CUL3, ANK2, GRIN2B. 
These four genes are of potential interest, as a malfunction of these genes 
in ASD patients may directly affect the development of social and cognitive 
skills in early childhood. 

5. Simulation studies. 

5.1. Identify expressed and unexpressed genes. We conducted simulation 
studies to evaluate the performance of our proposed MRF model. The ex¬ 
pression values for 100 genes in 16 brain regions and 13 periods were simu¬ 
lated. The number of replicates was set to be 3. The latent state array was 
first simulated and we considered two simulation settings: 
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TBR1 



CHD8 



Fig. 3. The dynamics of gene expression for TBR1 and CHD8 in frontal cortex regions. 
In periods f-5 and 6-7, TBR1 was differentially expressed in all frontal cortex regions, as 
indicated by the arrows in the figure. 


Simulation setting 1. The latent state array was simulated by Gibbs 
sampling. The sampler started from a random array with equal probability of 
being expressed or unexpressed. The latent states were updated sequentially 
by (2) and the MRF parameters were set to 7 = 0.08, /?i = 0.20 and /?2 = 1.5. 
We conducted three rounds of Gibbs sampling to obtain the latent state 
array X. 

Simulation setting 2. In period 1, all genes had equal probability of be¬ 
ing unexpressed/expressed. The latent states evolved over time by a Hidden 
Markov Model with 0.1 transition probability. The latent states for the 16 
brain regions were initially set to be the same. Then we let different propor¬ 
tions (0.1,0.2,0.5) of the latent states flip randomly. 
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Table 5 

Enrichment analysis of DE genes in the ASD gene set 



# of DE 
(expected) 

# of DE 
(ASD) 

Fold change 

p- value 

Periods 3-4 

0.3 

0 

0 

0.62 

Periods 4-5 

1.6 

4 

2.5 

0.03 

Periods 5-6 

1.2 

3 

2.5 

0.06 

Periods 6-7 

3.7 

6 

1.6 

0.05 

Periods 7-8 

2.1 

0 

0 

0.96 

Periods 8-9 

0.4 

0 

0 

0.67 

Periods 9-10 

1.0 

4 

3.9 

0.006 

Periods 10-11 

1.1 

2 

1.8 

0.19 

Periods 11-12 

1.1 

1 

0.9 

0.50 

Periods 12-13 

0.6 

0 

0 

0.72 

Periods 13-14 

0.3 

0 

0 

0.64 

Periods 14-15 

0.2 

0 

0 

0.60 


The gene expression levels were simulated based on the latent states. The 
mean gene expression array ft was generated from X by a Gaussian mixture 
model, where ji\ = 4.5, (J\ = 0.75, [12 = (5,5.5,6,6.5,7,7.5,8) and 07 = 1.5. 
We varied fi^ and kept the other parameters unchanged to test the model in 
different scenarios. Parameters were set to be the same for all brain regions. 
The gene expression levels Y were then simulated from a normal distribu¬ 
tion, with mean /1 and variance cTq = 0.25. The MCEM algorithm and the 
Gibbs sampler were implemented the same as in the previous sections. A 
comparison of misclassification rates was made between the MRF model 
and the simple Gaussian mixture model with no temporal and spatial de¬ 
pendency assumed (Table 6). For all simulation settings, the MRF model 
achieved significant improvement in misclassification rates compared with 
the simple Gaussian mixture model. 

5.2. Identify DE genes over time. In the simulation study, data were 
generated for 100 genes, 16 brain regions and 12 periods. We considered 
three simulation settings: 

Simulation setting 1. The latent state array S was updated sequentially 
by (7) and the MRF parameters were set to 7 de = —0.10, (3 CC = 0.31, (3 nn = 
0.52, /3 cn = 0.06 and f3 t = 0.14. To keep the ratio of DE genes roughly the 
same as that in the real data, the sampler started from a random array with 
0.4 probability of being DE. 10% of the genes were then randomly selected 
to be unexpressed in all brain regions from periods 1 to t or t to T = 12, 
where t was randomly picked from 1,...,T. The presence of unexpressed 
genes reflects the fact that a small portion of genes switched their states of 
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Table 6 

Comparison of misclassification rates between the simple Gaussian mixture model 
(GMM) and the MRF model. The standard deviations in 100 independent runs are 
shown in the brackets. The results for simulation settings 1 and 2 are presented, the 
numbers after the model names represent the proportions (0.1,0.2,0.5) of purturbation in 

simulation setting 2 


M2 

GMM 

MRF 

GMM (0.1) 

MRF (0.1) 

5 

0.421 (0.025) 

0.093 (0.008) 

0.426 (0.012) 

0.131 (0.004) 

5.5 

0.346 (0.017) 

0.084 (0.006) 

0.375 (0.013) 

0.11 (0.004) 

6 

0.275 (0.011) 

0.071 (0.005) 

0.31 (0.014) 

0.093 (0.002) 

6.5 

0.203 (0.006) 

0.055 (0.004) 

0.242 (0.009) 

0.083 (0.002) 

7 

0.144 (0.004) 

0.041 (0.003) 

0.185 (0.006) 

0.072 (0.003) 

7.5 

0.101 (0.003) 

0.029 (0.002) 

0.137 (0.004) 

0.053 (0.002) 

8 

0.067 (0.002) 

0.020 (0.001) 

0.096 (0.004) 

0.037 (0.002) 

M2 

GMM (0.2) 

MRF (0.2) 

GMM (0.5) 

MRF (0.5) 

5 

0.423 (0.008) 

0.233 (0.005) 

0.421 (0.004) 

0.344 (0.008) 

5.5 

0.378 (0.011) 

0.208 (0.005) 

0.377 (0.005) 

0.312 (0.011) 

6 

0.31 (0.012) 

0.18 (0.004) 

0.309 (0.004) 

0.261 (0.007) 

6.5 

0.242 (0.009) 

0.144 (0.004) 

0.243 (0.004) 

0.187 (0.004) 

7 

0.185 (0.004) 

0.106 (0.003) 

0.185 (0.004) 

0.133 (0.003) 

7.5 

0.137 (0.004) 

0.075 (0.002) 

0.138 (0.003) 

0.093 (0.002) 

8 

0.096 (0.003) 

0.051 (0.002) 

0.096 (0.003) 

0.060 (0.002) 


unexpressed/expressed in the real data. We conducted three rounds of Gibbs 
sampling to obtain the latent state array S. The z-score array Z was then 
generated from S by a mixture model. For EE, the 2 -score was generated 
from AT( 0,1); for DE, it was generated from AT(—2, 1) or AT(2, 1), with equal 
probability. 

Simulation setting 2. The latent state array S was simulated by Gibbs 
sampling with the same setting as in simulation setting 1. The mean gene 
expression array rl was then generated from S. In period 1, all the genes 
had mean expression values at 0. From period t to t + 1, Rbg(t+ 1) = Rbgt + 
Sbgtd, where 5 ~ Af(0, 1). Finally, the gene expression array Y was generated 
from fi, by Gaussian distribution with variance Uq =0.25 and the number of 
replicates was set to be 3. 

Simulation setting 3. In period 1, all the genes had 0.15 probability of 
being DE. From periods t to t + 1, 70% of the DE genes in period t randomly 
switched to EE, and the same number of EE genes randomly switched to 
DE, to keep the number of DE genes constant over time. To represent the 
neocortex and nonneocortex regions, the first 11 brain regions were set to 
have the same latent states and the other 5 brain regions were set to be 
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the same. Compared with the first 11 brain regions, 40% of the DE genes 
randomly switched to EE in the other 5 brain regions. Then we randomly 
selected different proportions (0.1,0.2,0.5) of the DE states to switch to EE; 
the same number of EE states were randomly selected to switch to DE. 10% 
of the genes were randomly selected to be unexpressed in all brain regions 
as in simulation setting 1. Finally, the z-score array Z was generated in the 
same way as in simulation setting 1. 

The settings for the MCEM algorithm and the Gibbs sampler were the 
same as those in the previous section. We calculated the sensitivity and 
specificity by varying the threshold for the posterior local-f.d.r. We compared 
the proposed MRF model with the empirical Bayesian (EB) model, which 
assumes no temporal and spatial dependency (Figure 4). As the neocortex 
group and the nonneocortex group have different numbers of brain regions 
(11 vs. 5), the ROC curves were plotted separately for the two groups. 
Compared with the EB model, the MRF model performed better in both the 
neocortex and nonneocortex regions. The improvement was more significant 
in the neocortex regions, as there were more brain regions and the MRF 
model benefits more from the spatial similarity. 

6. Conclusions and discussion. The statistical methods developed in this 
paper were motivated from the analysis of human brain development mi¬ 
croarray data. These data represent expression profiles in different brain re¬ 
gions at different developmental stages and they allow us to infer (1) whether 
a gene is expressed or not in a specific brain region in a specific period, and 
(2) whether a gene is differentially expressed between two adjacent periods 
in a specific brain region. To efficiently utilize the spatial similarity between 
brain regions and temporal dependency, we have developed a two-step mod¬ 
eling framework that is based on the Markov Random Field model and local 
FDR methodology to facilitate statistical inference. Our simulation studies 
suggest that this model has a lower misclassification rate compared with 
commonly used Gaussian mixture models without considering spatial sim¬ 
ilarity and temporal dependency. Simulation results and real data analysis 
also suggest that the proposed model improves the power to identify DE 
genes. 

The analysis of the human brain microarray data by our proposed model 
produces biologically meaningful results. The inferred latent states of “ex¬ 
pressed” or “unexpressed” were similar in all brain regions. The number of 
genes that switched their latent states first increased and peaked at birth, 
then gradually decreased in adulthood. In periods 6-7, the list of genes that 
switched from expressed to unexpressed was enriched for transcriptional 
regulatory genes. For the purpose of identifying DE genes between adjacent 
periods, we observed a similar trend in the number of DE genes. However, 
there was an additional peak in periods that correspond to childhood and 
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(e) Setting 3, neocortex 


(f) Setting 3, nonneocortex 


Fig. 4. The ROC curves comparing the empirical Bayesian (EB) model and the proposed 
MRF model. The curves were averaged over 100 simulations. 


adolescence. These observations reflect the dynamics of the neurodevelop- 
rnent process. We also observed that genes carrying a high risk for neurode¬ 
velopment disorders, such as ASD, tended to be differentially expressed, 
especially during periods when cognitive and social skills were developed. 

We have also proposed and implemented an MCEM algorithm to estimate 
the model parameters and a separate Gibbs sampler to estimate the pos- 
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terior probability. In previous studies, the iterated conditional mode (ICM) 
algorithm was implemented to estimate the MRF parameters [Wei and Li 
(2008), Li, Wei and Maris (2010), Besag (1986)]; however, our simulation 
study suggested that the ICM algorithm may lead to biased parameter esti¬ 
mates (supplementary material Section 10 [Lin et al. (2015)]). One limitation 
of the MCEM algorithm is the high computing cost. Under the current set¬ 
ting for the MCEM algorithm, the computing time for the whole data set 
took ten days (five days for biological question 1 and five days for biological 
question 2) on the Yale Louise high performance cluster (Dell m620 system, 
8 core processor, 48 GB of memory). To accelerate convergence, we started 
the model from the estimation which does not consider the spatial and tem¬ 
poral dependency. Another limitation of the MCEM algorithm is that the 
Monte Carlo sum is an approximation to the expectation and may lead to 
instability in parameter estimation. In the diagnosis of the MCEM algorithm 
(supplementary material Section 5 [Lin et al. (2015)]), we demonstrated that 
our model is robust to unstable parameter estimation. Levine and Casella 
(2001) provided a detailed discussion on the setting of the MCEM algorithm. 


APPENDIX 

We provide details on the derivation of the conditional probability (2) 
from the joint probability (1). 

For t / 1 and f / T, 

p(x bgt = 1| X./x bg t]&) 
p{xbgt = 0| X./x b gt] $) 

_ p{x bgt = 1 ,X./x b gtl&) 
p{x bg t = 0 ,X/x 6gt ;$) 


= exp< 7 i - 70 + /3i ^ 2 [h(x b ' g t) ~ h{x b ' g t)] 

^ b'^b 

T 02 [-U i.X b g(t— 1) ) Io(,X b g(t— 1)) T I\ (•£f>g(t-|-l) ) Io(.X b g(t+l)')} 

= exp<^ 7 + 01 ( 2 x b 'gt - 1) + 02 [2x 6s (i_i) - 1 + 2 x bg{t+1) - 1] k 
^ b'^b > 

p(x bgt = 1|X/ x bg t; $)+p(x bgt = 0|X/x b9i ; $) = 1, so we have 
/ 1!V / ^ exp{F(x fe9 t,$)} 

p(x isi = 1|X/* W ; *) = 1+exp{F(lw ^ )} . 

where 

F(xb g t, 3*) 7 Y /?i ^^(2 x b ig b — 1) + /?2{2xfig(i_i) 1 T 2xf,gp_!i) — 1}. 

b'^b 
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For t = 1 and t = T, the conditional probability can be derived similarly. 
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